TIMIT-TTS: A Text-to-Speech Dataset for Multimodal Synthetic Media Detection

نویسندگان

چکیده

With the rapid development of deep learning techniques, generation and counterfeiting multimedia material has become increasingly simple. Current technology enables creation videos where both visual audio contents are falsified. While forensics community begun to address this threat by developing fake media detectors. However, vast majority existing forensic techniques only analyze one modality at a time. This is an important limitation when authenticating manipulated videos, because sophisticated forgeries may be difficult detect without exploiting cross-modal inconsistencies (e.g., across tracks). One reason for lack multimodal detectors similar research datasets containing forgeries. Existing typically contain falsified modality, such as deepfaked with authentic tracks, or synthetic no associated video. Currently, needed that can used develop, train, test these algorithms. In paper, we propose new audio-visual deepfake dataset video We present general pipeline synthesizing speech content from given video, facilitating counterfeit material. The proposed method uses Text-to-Speech (TTS) Dynamic Time Warping (DTW) achieve realistic tracks. use generate release TIMIT-TTS, most cutting-edge methods in TTS field. standalone dataset, combined DeepfakeTIMIT VidTIMIT perform research. Finally, numerous experiments benchmark monomodal (i.e., audio) video) conditions. highlights need more data.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Taiwanese (min-nan) text-to-speech (TTS) system based on automatically generated synthetic units

A Taiwanese (Min-nan) Text-to-Speech (TTS) system has been constructed in this paper based on automatically generated synthetic units by considering several specific phonetic and linguistic characteristics of Taiwanese. Some basic facts about Taiwanese useful in a TTS system is summarized, including the issues of tone sandhi, the writen format and the others. Three functional modules, namely a ...

متن کامل

USC-TIMIT: A database of multimodal speech production data

USC-TIMIT is a speech production database under ongoing development, which currently includes real-time magnetic resonance imaging data from five male and five female speakers of American English, and electromagnetic articulography data from five of these speakers. The two modalities were recorded in two independent sessions while the subjects produced the same 460 sentence corpus. In both case...

متن کامل

A Multimodal Dataset for Deception Detection

This paper presents the construction of a multimodal dataset for deception detection, including physiological, thermal, and visual responses of human subjects under three deceptive scenarios. We present the experimental protocol, as well as the data acquisition process. To evaluate the usefulness of the dataset for the task of deception detection, we present a statistical analysis of the physio...

متن کامل

CHULA TTS: A Modularized Text-To-Speech Framework

Spoken and written languages evolve constantly through their everyday usages. Combining with practical expectation for automatically generating synthetic speech suitable for various domains of context, such a reason makes Text-to-Speech (TTS) systems of living languages require characteristics that allow extensible handlers for new language phenomena or customized to the nature of the domains i...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: IEEE Access

سال: 2023

ISSN: ['2169-3536']

DOI: https://doi.org/10.1109/access.2023.3276480